2023-11-30
here() packagehere() packagehere() packageThe here package creates paths relative to the top-level directory. The package displays the top-level of the current project on load or any time you call here():
here() packageHow can you avoid setwd() at the top of every script?
Suppose we have the following small data set, where 77 = “Refused” and 99 = “No Response” or some other term that we want to think of as “missing”.
var1 <- c(20, 22, 35, 19, 77, 99)
var2 <- c(1, 3, 4, 77, 6, 99)
var3 <- c("Yes", "No", 77, 99, "No", "Yes")
dat <- tibble(var1, var2, var3) |> mutate(var3 = factor(var3))
miss_var_summary(dat)# A tibble: 3 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 var1 0 0
2 var2 0 0
3 var3 0 0
How can we convince R the 77s and 99s are missing values?
replace_with_na() from naniardat1 <- dat |>
replace_with_na(
replace = list(var1 = c(77, 99), var2 = c(77, 99),
var3 = c(77, 99)))
miss_var_summary(dat1)# A tibble: 3 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 var1 2 33.3
2 var2 2 33.3
3 var3 2 33.3
More on replace_with_na() here
# A tibble: 3 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 var1 2 33.3
2 var2 2 33.3
3 var3 2 33.3
Other ways to extend replace_with_na() are described here
The replace_na() function from the tidyr package (details here) replaces an NA value with a specified value.
In that sense, it is the compliment to the replace_with_na() function.
# A tibble: 3 × 2
x y
<dbl> <chr>
1 1 a
2 2 <NA>
3 NA b
# A tibble: 3 × 2
x y
<dbl> <chr>
1 1 a
2 2 unknown
3 0 b
More on replace_na() here
Suppose we wish to investigate the association between retirement status and heart disease. One concern might be the age of the subjects: an older person is both more likely to be retired and more likely to have heart disease. In one study, therefore, 127 victims of cardiac arrest were matched on a number of characteristics that included age with 127 healthy control subjects: retirement status was then ascertained for each subject, with the results shown on the next slide.
| – | Retired | Not Retired | ALL |
|---|---|---|---|
| Healthy Control | 39 | 88 | 127 |
| Cardiac Arrest | 47 | 80 | 127 |
| ALL | 86 | 168 | 254 |
We might think to use our usual twobytwo() function, but each matched pair in this study provides two responses, one for the Healthy Control and one for the Cardiac Arrest.
Note
CA = Cardiac Arrest, HC = Healthy Control
| – | CA, Retired | CA, Not Retired | Total |
|---|---|---|---|
| HC, Retired | 27 | 12 | 39 |
| HC, Not Retired | 20 | 68 | 88 |
| Total | 47 | 80 | 127 |
The relevant data are stored for you in the retire.csv file.
retire tibbleretire <- read_csv(here("c24", "data", "retire.csv")) |>
mutate(healthy_control =
fct_relevel(factor(healthy_control), "Retired", "Not Retired"),
cardiac_arrest =
fct_relevel(factor(cardiac_arrest), "Retired", "Not Retired"))
retire |> tabyl(healthy_control, cardiac_arrest) |>
adorn_totals(where = c("row", "col")) |> adorn_title() cardiac_arrest
healthy_control Retired Not Retired Total
Retired 27 12 39
Not Retired 20 68 88
Total 47 80 127
| – | CA, Retired | CA, Not Retired |
|---|---|---|
| HC, Retired | 27 | 12 |
| HC, Not Retired | 20 | 68 |
\(H_0\): There are equal numbers of pairs in which the healthy control retired and the matched individual who had a cardiac arrest did not retire, and in which the healthy control did not retire but the matched individual who had a cardiac arrest did retire.
| – | CA, Retired | CA, Not Retired |
|---|---|---|
| HC, Retired | 27 | 12 |
| HC, Not Retired | 20 | 68 |
The concordant pairs - or the pairs when two retired people or two non-retired people are matched - provide no information for testing a null hypothesis about differences in retirement status.
Thus, we focus only on the discordant pairs - or the pairs where a person who retires is paired with a person who has not retired.
| – | CA, Retired | CA, Not Retired |
|---|---|---|
| HC, Retired | 27 | 12 |
| HC, Not Retired | 20 | 68 |
The McNemar odds ratio is just the ratio of the two discordant pairs, here: 12/20 = 0.60, which indicates the association’s strength – the odds of retiring for healthy controls are estimated to be 0.6 times as high as the odds of retiring for those with cardiac arrest.
\(H_0\): No association between retirement status and cardiac arrest
McNemar's Chi-squared test with continuity correction
data: table(retire$healthy_control, retire$cardiac_arrest)
McNemar's chi-squared = 1.5312, df = 1, p-value = 0.2159
Without continuity correction, and tidied…
epibasix package (includes odds ratio)The mcNemar() function comes from the epibasix package.
Matched Pairs Analysis: McNemar's Chi^2 Statistic and Odds Ratio
McNemar's Chi^2 Statistic (corrected for continuity) = 1.531 which has a p-value of: 0.216
McNemar's Odds Ratio (b/c): 0.6
90% Confidence Limits for the OR are: [0.28, 1.134]
McNemar’s test is typically valid when the number of discordant pairs exceeds 30.
exact2x2() (also includes odds ratio)exact2x2 package can also do this, using a slightly different approximation for the confidence interval.exact2x2(retire$healthy_control, retire$cardiac_arrest, paired = TRUE,
conf.int = TRUE, conf.level = 0.90)
Exact McNemar test (with central confidence intervals)
data: retire$healthy_control and retire$cardiac_arrest
b = 12, c = 20, p-value = 0.2153
alternative hypothesis: true odds ratio is not equal to 1
90 percent confidence interval:
0.3031389 1.1534984
sample estimates:
odds ratio
0.6
In a double-blind trial, patients with active rheumatoid arthritis will be randomly assigned to receive one of two therapy types: a cheaper one, or a pricier one.
The proposed primary analysis will estimate the difference in the proportion of participants who had a DAS28 of 3.2 or less at week 48. The DAS28 is a composite index of the number of swollen and tender joints, the erythrocyte sedimentation rate, and a visual-analogue scale of patient-reported disease activity.
The study’s power analysis established a sample size target of 225 completed enrollments in each therapy group, based on a two-sided 10% significance level, and a desire for 90% power, assuming the proportion of participants with a DAS28 of 3.2 or less at week 48 was 0.27 under the less effective therapy.
What value was used in the power calculation for the proportion of participants with DAS28 of 3.2 or less at week 48 for the more effective therapy? Round your response to two decimal places.
Two-sample comparison of proportions power calculation
n = 225
p1 = 0.27
p2 = 0.3996866
sig.level = 0.1
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
Rounding to two decimal places, p2 was 0.40
An investigator plans to replicate part of a study of the gut hormone fragment peptide \(YY_{3-36}\) (PYY) which reduces appetite and food intake when infused into subjects of normal weight. The original study is found here.
In common with the adipocyte hormone leptin, PYY reduces food intake by modulating appetite circuits in the hypothalamus. However, in obesity there is a marked resistance to the action of leptin, which greatly limits its therapeutic effectiveness.
The investigator wants to know whether obese subjects are also resistant to the anorectic effects of PYY. She intends to perform a randomized, placebo-controlled, double-blind crossover study on healthy obese subjects (including an equal number of male subjects and female subjects), with each subject studied on two occasions one week apart.
The subjects will be screened by a dietitian who will assess their eating behavior with (several established scales).
On two consecutive Thursdays, we will measure caloric intake during a buffet lunch offered two hours after the infusion of that week’s exposure. In one of the weeks, the subject will receive an infusion of PYY, and in the other week (with the order of the weeks determined at random) the subject will receive a placebo. The number of calories consumed at each lunch is measured and then converted to an appetite rating. Our primary outcome is the difference between the appetite rating after PYY and the appetite rating after placebo.
A clinically meaningful difference, the investigator tells you, would be one in which these comparisons would differ by 30 or more points on the appetite rating scale comparing the two infusions, which is 60% of the anticipated standard deviation of these results. The investigator then asks whether a study which gathers a total of 64 observations will yield at least 90% power to detect an effect of this size using a 5% two-tailed significance level, and to meet all other requirements described above.
Paired t test power calculation
n = 32
delta = 30
sd = 50
sig.level = 0.05
power = 0.9077895
alternative = two.sided
NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs
What is the proper conclusion? Do we have at least 90% power?
The Role of Statistics in Data Science and Artificial Intelligence (pdf): 2023-08-04
Statistics plays a central role in data science and AI, especially in the areas of ML and deep learning. Framing questions statistically allows leveraging data resources to extract knowledge and obtain better answers. The central dogma of statistical inference, that there is a component of randomness in data, enables researchers to formulate questions in terms of underlying processes, quantify uncertainty in their answers, and separate signal from noise. A statistical framework allows researchers to distinguish between causation and correlation, and thus to identify interventions that will cause changes in outcomes.
The Role of Statistics in Data Science and Artificial Intelligence (pdf): 2023-08-04
The future of data science and AI is uncertain, but one thing is clear: the field will undoubtedly continue to evolve rapidly and have a profound impact on society. As a result, the role of statisticians will also be subject to change and expansion, as they must adapt to new technologies and tools while continuing to provide expertise in traditional areas of statistics such as uncertainty quantification, sampling design, and causal inference. The need for interdisciplinary collaboration and a diverse range of skills will become increasingly important for statisticians to remain relevant in this dynamic and ever-changing field.
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
Package version:
askpass_1.2.0 backports_1.4.1 base64enc_0.1.3
bigD_0.2.0 bit_4.0.5 bit64_4.0.5
bitops_1.0.7 blob_1.2.4 brio_1.1.3
broom_1.0.5 bslib_0.5.1 cachem_1.0.8
callr_3.7.3 cellranger_1.1.0 cli_3.6.1
clipr_0.8.0 colorspace_2.1-0 commonmark_1.9.0
compiler_4.3.1 conflicted_1.2.0 cpp11_0.4.6
crayon_1.5.2 curl_5.1.0 data.table_1.14.8
DBI_1.1.3 dbplyr_2.4.0 desc_1.4.2
diffobj_0.3.5 digest_0.6.33 dplyr_1.1.3
dtplyr_1.3.1 ellipsis_0.3.2 epibasix_1.5
evaluate_0.22 exact2x2_1.6.8 exactci_1.4-4
fansi_1.0.5 farver_2.1.1 fastmap_1.1.1
fontawesome_0.5.2 forcats_1.0.0 fs_1.6.3
gargle_1.5.2 generics_0.1.3 ggplot2_3.4.4
glue_1.6.2 googledrive_2.1.1 googlesheets4_1.1.1
graphics_4.3.1 grDevices_4.3.1 grid_4.3.1
gridExtra_2.3 gt_0.10.0 gtable_0.3.4
haven_2.5.3 here_1.0.1 highr_0.10
hms_1.1.3 htmltools_0.5.6.1 htmlwidgets_1.6.2
httr_1.4.7 ids_1.0.1 isoband_0.2.7
janitor_2.2.0 jquerylib_0.1.4 jsonlite_1.8.7
juicyjuice_0.1.0 knitr_1.44 labeling_0.4.3
lattice_0.21.8 lifecycle_1.0.3 lubridate_1.9.3
magrittr_2.0.3 markdown_1.11 MASS_7.3.60
Matrix_1.6.1.1 memoise_2.0.1 methods_4.3.1
mgcv_1.8.42 mime_0.12 modelr_0.1.11
munsell_0.5.0 naniar_1.0.0 nlme_3.1.162
norm_1.0.11.1 openssl_2.1.1 parallel_4.3.1
pillar_1.9.0 pkgbuild_1.4.2 pkgconfig_2.0.3
pkgload_1.3.3 plyr_1.8.9 praise_1.0.0
prettyunits_1.2.0 processx_3.8.2 progress_1.2.2
ps_1.7.5 purrr_1.0.2 R6_2.5.1
ragg_1.2.6 rappdirs_0.3.3 RColorBrewer_1.1.3
Rcpp_1.0.11 reactable_0.4.4 reactR_0.5.0
readr_2.1.4 readxl_1.4.3 rematch_2.0.0
rematch2_2.1.2 reprex_2.0.2 rlang_1.1.1
rmarkdown_2.25 rprojroot_2.0.3 rstudioapi_0.15.0
rvest_1.0.3 sass_0.4.7 scales_1.2.1
selectr_0.4.2 snakecase_0.11.1 splines_4.3.1
ssanv_1.1 stats_4.3.1 stringi_1.7.12
stringr_1.5.0 sys_3.4.2 systemfonts_1.0.5
testthat_3.2.0 textshaping_0.3.7 tibble_3.2.1
tidyr_1.3.0 tidyselect_1.2.0 tidyverse_2.0.0
timechange_0.2.0 tinytex_0.48 tools_4.3.1
tzdb_0.4.0 UpSetR_1.4.0 utf8_1.2.3
utils_4.3.1 uuid_1.1.1 V8_4.4.0
vctrs_0.6.4 viridis_0.6.4 viridisLite_0.4.2
visdat_0.6.0 vroom_1.6.4 waldo_0.5.1
withr_2.5.1 xfun_0.40 xml2_1.3.5
yaml_2.3.7
431 Class 24 | 2023-11-30 | https://thomaselove.github.io/431-2023/